1 Introduction

I am very interested in visualizations. Especially, geospatial or geographic visualization! It is all the more better if the visualization is interactive.

For my Data Incubator fellowship application challenge I tried to create a spatial animation of daily confirmed cases and deaths, in Indiana, from COVID-19 as reported by Johns Hopkins University’s interactive webiste (Dong, Du, and Gardner 2020).

1.1 Motivation

One of my persistent annoyances (and frustrations) is the central incompatibility between how the commonly available spatial data is organized and how the data needs to be organized to be used for analytics. The central issue is this: most spatial data is stored in wide1 format whereas most analytics software needs the data to be in long format. And, not to mention all the various minor details (esoteric file formats of spatial data, missing/invalid data, R’s automatic type conversion, etc.) that need to be considered to successfully join spatial data with attribute data before you can even do any analytics. Lest you think I have been living under a rock I am quite aware, from my own past experiences and also reading articles and blogs similar to this NYT article, that data munging can take surprisingly more time than anticipated. But despite my better judgement I thought: how hard can this really be? Right away you know that you are reading the words of a novice data scientist!

2 Data and methods

There are two data components needed to achieve my objective: 1) COVID-19 time series data, and 2) Geographic boundaries data.

## load the libraries to get started...
library("data.table")
library("sf")
library("ggplot2")

2.1 COVID-19 time series data

Johns Hopkins University (JHU) provides the data in CSV format at https://github.com/CSSEGISandData/COVID-19. Exploring the folders, I surmised that I needed to get the time series data. So, I downloaded the CSV files time_series_covid19_confirmed_US.csv and time_series_covid19_deaths_US.csv. Some of you might be wondering: why the heck didn’t you just clone the repository? to which I reply: sorry, I was being naive.2

Let’s read the data and see what we have.

confirmed <- fread("time_series_covid19_confirmed_US.csv", colClasses="character")
dim(confirmed)
## [1] 3261  172
confirmed[1:7, 1:15]
##         UID iso2 iso3 code3   FIPS  Admin2           Province_State
## 1:       16   AS  ASM    16   60.0                   American Samoa
## 2:      316   GU  GUM   316   66.0                             Guam
## 3:      580   MP  MNP   580   69.0         Northern Mariana Islands
## 4:      630   PR  PRI   630   72.0                      Puerto Rico
## 5:      850   VI  VIR   850   78.0                   Virgin Islands
## 6: 84001001   US  USA   840 1001.0 Autauga                  Alabama
## 7: 84001003   US  USA   840 1003.0 Baldwin                  Alabama
##    Country_Region         Lat        Long_                 Combined_Key 1/22/20
## 1:             US     -14.271     -170.132           American Samoa, US       0
## 2:             US     13.4443     144.7937                     Guam, US       0
## 3:             US     15.0979     145.6739 Northern Mariana Islands, US       0
## 4:             US     18.2208     -66.5901              Puerto Rico, US       0
## 5:             US     18.3358     -64.8963           Virgin Islands, US       0
## 6:             US 32.53952745 -86.64408227         Autauga, Alabama, US       0
## 7:             US 30.72774991 -87.72207058         Baldwin, Alabama, US       0
##    1/23/20 1/24/20 1/25/20
## 1:       0       0       0
## 2:       0       0       0
## 3:       0       0       0
## 4:       0       0       0
## 5:       0       0       0
## 6:       0       0       0
## 7:       0       0       0
head(colnames(confirmed), 20)
##  [1] "UID"            "iso2"           "iso3"           "code3"         
##  [5] "FIPS"           "Admin2"         "Province_State" "Country_Region"
##  [9] "Lat"            "Long_"          "Combined_Key"   "1/22/20"       
## [13] "1/23/20"        "1/24/20"        "1/25/20"        "1/26/20"       
## [17] "1/27/20"        "1/28/20"        "1/29/20"        "1/30/20"
deaths <- fread("time_series_covid19_deaths_US.csv", colClasses="character")
dim(deaths)
## [1] 3261  173
deaths[1:7, 1:15]
##         UID iso2 iso3 code3  FIPS  Admin2           Province_State
## 1:       16   AS  ASM    16    60                   American Samoa
## 2:      316   GU  GUM   316    66                             Guam
## 3:      580   MP  MNP   580    69         Northern Mariana Islands
## 4:      630   PR  PRI   630    72                      Puerto Rico
## 5:      850   VI  VIR   850    78                   Virgin Islands
## 6: 84001001   US  USA   840 01001 Autauga                  Alabama
## 7: 84001003   US  USA   840 01003 Baldwin                  Alabama
##    Country_Region         Lat        Long_                 Combined_Key
## 1:             US     -14.271     -170.132           American Samoa, US
## 2:             US     13.4443     144.7937                     Guam, US
## 3:             US     15.0979     145.6739 Northern Mariana Islands, US
## 4:             US     18.2208     -66.5901              Puerto Rico, US
## 5:             US     18.3358     -64.8963           Virgin Islands, US
## 6:             US 32.53952745 -86.64408227         Autauga, Alabama, US
## 7:             US 30.72774991 -87.72207058         Baldwin, Alabama, US
##    Population 1/22/20 1/23/20 1/24/20
## 1:      55641       0       0       0
## 2:     164229       0       0       0
## 3:      55144       0       0       0
## 4:    2933408       0       0       0
## 5:     107268       0       0       0
## 6:      55869       0       0       0
## 7:     223234       0       0       0
head(colnames(deaths), 20)
##  [1] "UID"            "iso2"           "iso3"           "code3"         
##  [5] "FIPS"           "Admin2"         "Province_State" "Country_Region"
##  [9] "Lat"            "Long_"          "Combined_Key"   "Population"    
## [13] "1/22/20"        "1/23/20"        "1/24/20"        "1/25/20"       
## [17] "1/26/20"        "1/27/20"        "1/28/20"        "1/29/20"

As evident, this data is organized in a wide format where rows correspond to the counties of US and the columns correspond to attributes of the county and a column for each date, starting 2020.01.22, when JHU started tracking data. Looking at the dimensions, especially the number of columns, of these data I definitely needed to prepare the dataset in order to join it with the spatial data.

The FIPS related formatting is because time_series_covid19_confirmed_US.csv stores the FIPS column as a numeric with a decimal 0 (see row 6)! In fact, I spent more than four hours trying to chase data type issues3 before finally deciding to read the data with all columns as character vectors and then do the necessary data conversion myself. I began by fixing FIPS, which is the key column needed for joining the spatial data.

confirmed[, FIPS:=as.character(as.integer(FIPS))]
confirmed[nchar(FIPS)==4, FIPS:=sprintf("0%s",FIPS)]

deaths[, FIPS:=as.character(as.integer(FIPS))]
deaths[nchar(FIPS)==4, FIPS:=sprintf("0%s",FIPS)]

2.2 Geographic boundary data

Anyone who has ever worked with any GIS software has inevitably had to deal with shapefiles. While they might have been acceptable for a time period when they were created they are not suited for modern times. The GIS community, at least the open source geospatial community, realizes these shortcomings and is coming up with alternative formats such as GeoPackage, GeoJSON. But visit any of the federal, state, or local government websites, and you are very likely to run into spatial data provided only as shapefile, usually offered in a zip archive. There are many issues with shapefiles (just check out http://switchfromshapefile.org/)…but I know how to deal with all of them…just convert it to a GeoPackage! So, I downloaded the US county shapefile (cb_2018_us_county_500k.zip)4 from US Census Bureau’s Cartographic Boundary Files - Shapefile and converted it to a geopackage.

## d <- st_read("cb_2018_us_county_500k.shp") 
## st_write(d, "counties.gpkg") 
## rm(d)
(counties_geo <- st_read("counties.gpkg", quiet=TRUE))
## Simple feature collection with 3233 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
## geographic CRS: NAD83
## First 10 features:
##    STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID    NAME LSAD      ALAND
## 1       21      007 00516850 0500000US21007 21007 Ballard   06  639387454
## 2       21      017 00516855 0500000US21017 21017 Bourbon   06  750439351
## 3       21      031 00516862 0500000US21031 21031  Butler   06 1103571974
## 4       21      065 00516879 0500000US21065 21065  Estill   06  655509930
## 5       21      069 00516881 0500000US21069 21069 Fleming   06  902727151
## 6       21      093 00516893 0500000US21093 21093  Hardin   06 1614569777
## 7       21      099 00516896 0500000US21099 21099    Hart   06 1068530028
## 8       21      131 00516912 0500000US21131 21131  Leslie   06 1038206077
## 9       21      151 00516919 0500000US21151 21151 Madison   06 1132729653
## 10      21      155 00516921 0500000US21155 21155  Marion   06  888463701
##      AWATER                           geom
## 1  69473325 MULTIPOLYGON (((-89.18137 3...
## 2   4829777 MULTIPOLYGON (((-84.44266 3...
## 3  13943044 MULTIPOLYGON (((-86.94486 3...
## 4   6516335 MULTIPOLYGON (((-84.12662 3...
## 5   7182793 MULTIPOLYGON (((-83.98428 3...
## 6  17463238 MULTIPOLYGON (((-86.27756 3...
## 7  13692536 MULTIPOLYGON (((-86.16112 3...
## 8   9189732 MULTIPOLYGON (((-83.5531 37...
## 9  15306635 MULTIPOLYGON (((-84.52564 3...
## 10  9891797 MULTIPOLYGON (((-85.52129 3...

Some things to notice are:

  1. The geospatial data has only 3233 counties whereas the timeseries data has 3261 counties! That is because it has some extra rows for Grand Princess, Federal Corrections (MDOC) etc. which are irrelevant for my purposes.
  2. The GEOID column is comprised of concatenating STATEFP and COUNTYFP columns. You will also note that table(nchar(counties_geo$GEOID)) yields 3233 and that’s why we had to ‘0’-prefix some of our FIPS rows earlier.
  3. Finally, it appears that the GEOID and FIPS columns can be used to join/relate the attribute and spatial tables.

2.3 Transforming data

While we can definitely use the tabular data with the geospatial data “as is”, there are a few things that we may want to consider:

  1. Since this timeseries is updated daily, there will be more columns as newer data are added.

  2. Filtering, especially date range based, is going to involve selecting different columns. For instance, exploring cases/deaths between April 1, 2020 to June 15, 2020 will require figuring out which columns to select. And, then a little while later if we wish to explore cases/deaths between April 15, 2020 to June 30, 2020 then we will again have to figure out which columns to select.

    Contrast that with storing this data in a long format, where there are three columns: one for date, another for confirmed cases, and the third column for deaths. Now selecting based on date ranges is a simple matter of row filtering that can be done by R’s builtin indexing or other data munging packages (data.table, dplyr, tidyverse, reshape).

  3. Determining a date (or date ranges) based on some combination of values (either confirmed case or deaths or some transformation of them) are going to be rather tedious.

Hence, it would be good if we can convert this to long data format. By the way, this is the same format in which many of the analytics oriented applications, R included, view the data. That is, instead of thinking of a table consisting of rows we should think of it as a collection of columns, each of which is a vector of values! This is why inquiring R about a data.frame’s length (for instance length(mtcars)) will yield the number of columns (i.e., dim(mtcars)[[2L]] which is 11). This view of the data is called an “inverted table” (Hui and Kromberg 2020 pg 11) and is the predominant view of columnar databases and analytic processing systems! Wide-to-long conversion is commonly called a “melt” operation in R parlance. Therefore let us melt our data.

deaths.id.vars <- 1:12 ## head(colnames(deaths), 14) ... just to verify
confirmed.id.vars <- 1:11 ## head(colnames(confirmed), 13)

deaths.long <- melt(deaths, id.vars=colnames(deaths)[deaths.id.vars], 
                    measure.vars=colnames(deaths)[-deaths.id.vars], 
                    variable.name="date1", value.name="deaths")
confirmed.long <- melt(confirmed, id.vars=colnames(confirmed)[confirmed.id.vars], 
                       measure.vars=colnames(confirmed)[-confirmed.id.vars], 
                       variable.name="date1", value.name="confirmed_cases")

stopifnot(nrow(deaths) * (ncol(deaths)-length(deaths.id.vars)) == 
          nrow(deaths.long))
stopifnot(nrow(confirmed) * (ncol(confirmed)-length(confirmed.id.vars)) == 
          nrow(confirmed.long))

# just-in-case date formatting messes up we need the original date.
deaths.long[, `:=`(ddate1 = strftime(as.Date(as.character(date1), format="%m/%d/%y"), 
                                     format="%Y-%m-%d")
                  ,deaths=as.integer(deaths) )]
confirmed.long[, `:=`(ddate1=strftime(as.Date(as.character(date1), format="%m/%d/%y"), 
                                      format="%Y-%m-%d")
                     ,confirmed_cases = as.integer(confirmed_cases))]

We melt the data by using the first few columns as id variables and all the date columns as measure variables. It is a good idea to use stopifnot statements to ensure that the transform data conforms to what we are expecting. Starting from R version 4.0 stopifnot now accepts custom error messages via argument names to make argument checking easier.

2.4 Plotting the data

Now that we have our data, let us try to plot it. While I wanted to create an animation for all the counties in US I ran into lots of issues. So, I decided to just do a small example for Indiana counties.

IN_geo <- counties_geo[grepl("^18", counties_geo$GEOID),]
deaths_IN <- deaths.long[grepl("^18", FIPS),]
confirmed_IN <- confirmed.long[grepl("^18", FIPS), ]

deaths_20200627 <- deaths_IN[ddate1==as.Date('2020-06-27'), ]
cases_20200627 <- confirmed_IN[ddate1==as.Date('2020-06-27'), ]

par(mfrow=c(1,2))
d <- merge(IN_geo, deaths_20200627, by.x="GEOID", by.y="FIPS")
plot(d[, "deaths"], main="Deaths 2020-06-27", breaks='kmeans')
d <- merge(IN_geo, cases_20200627, by.x="GEOID", by.y="FIPS")
plot(d[, "confirmed_cases"], main="Confirmed cases 2020-06-27", breaks='kmeans')

Now that we know how to create a plot let’s try to create a whole bunch of images which we can combine into an animation.

outdir <- "figs"
dts <- unique(deaths_IN$ddate1)

## for(d in dts[155:160]) {
for(d in dts) {
    deaths_png <- file.path(outdir, sprintf("deaths_%s.png", d))
    cases_png <- file.path(outdir, sprintf("cases_%s.png", d))

    dd <- deaths_IN[ddate1==d, .(FIPS, deaths, ddate1)]
    cc <- confirmed_IN[ddate1==d,.(FIPS, confirmed_cases, ddate1)]
    ii <- IN_geo[, c("NAME", "GEOID", "geom")]
    dat <- merge(ii, cc, by.x="GEOID", by.y="FIPS")
    dat$confirmed_cases <- dat$confirmed_cases/sum(dat$confirmed_cases)
    if(!file.exists(cases_png)) {
      png(cases_png)
      ## plot(dat[, "confirmed_cases"], main=d, key.pos=NULL, border=0L, breaks="kmeans")
      plot(dat[, "confirmed_cases"], main=d, breaks="kmeans")
      dev.off()
    }

    dat <- merge(ii, dd, by.x="GEOID", by.y="FIPS")
    dat$deaths <- dat$deaths/sum(dat$deaths)
    if(!file.exists(deaths_png)) {
      png(deaths_png)
      ## plot(dat["deaths"], main=d, key.pos=NULL, border=0L, breaks="kmeans")
      plot(dat["deaths"], main=d, breaks="kmeans")
      dev.off()
    }
}

And, finally we can combine all of these pngs into a gif using ImageMagick using something like convert figs/cases*.png cases.gif. Here is the very rudimentary result of my unsuccessful attempt.

Number of confirmed cases from 2020.01.22–2020.06.30

3 Lessons learned

  1. Working with geospatial data is a lot trickier than anticipated. I tend to forget this fact quite often and have to make a conscious, and regular, effort to keep this in mind.

  2. Use git and make more often so that I become familiar enough with these tools to use them unconsciously.

  3. Try to always remeber:

    It always takes longer than you expect, even if you take Hofstadter’s Law into account. – Douglas Hofstadter

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2     sf_1.0-0          data.table_1.12.8
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         compiler_4.0.2     pillar_1.4.4       class_7.3-17      
##  [5] tools_4.0.2        digest_0.6.25      evaluate_0.14      lifecycle_0.2.0   
##  [9] tibble_3.0.2       gtable_0.3.0       pkgconfig_2.0.3    rlang_0.4.6       
## [13] DBI_1.1.0          yaml_2.2.1         xfun_0.15          e1071_1.7-3       
## [17] withr_2.2.0        dplyr_1.0.0        stringr_1.4.0      knitr_1.29        
## [21] generics_0.0.2     vctrs_0.3.1        classInt_0.4-3     grid_4.0.2        
## [25] tidyselect_1.1.0   glue_1.4.1         R6_2.4.1           rmarkdown_2.3     
## [29] purrr_0.3.4        magrittr_1.5       scales_1.1.1       htmltools_0.5.0   
## [33] ellipsis_0.3.1     units_0.6-7        colorspace_1.4-1   KernSmooth_2.23-17
## [37] stringi_1.4.6      munsell_0.5.0      crayon_1.3.4

References

Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. “An Interactive Web-Based Dashboard to Track COVID-19 in Real Time.” The Lancet Infectious Diseases 20 (5): 533–34. https://doi.org/10.1016/S1473-3099(20)30120-1.

Hui, Roger K. W., and Morten J. Kromberg. 2020. “APL Since 1978.” Proceedings of the ACM on Programming Languages 4 (HOPL): 1–108. https://doi.org/10.1145/3386319.


  1. https://en.wikipedia.org/wiki/Wide_and_narrow_data↩︎

  2. Firstly, I rejoice that I can even comprehend what you mean by this question. Secondly, as far as I’m aware git is not an integral part of much of GIS (my own area of comfort) workflows. But, most important of all, I have not yet internalized, and integrated, git comprehensively into my own workflow!↩︎

  3. I could never figure out whether it was R’s automatic data conversion or my heavy customization that was causing all these issues. I tried running R --vanilla which only increased my problems…sigh!↩︎

  4. Since I was mostly interested in getting the data into R it did not matter whehter I chose shapefile or geodatabase. I would have definitely chosen geodatabase if the data size would have exceeded shapefile’s limitations.↩︎